A Fast and Accurate Vietnamese Word Segmenter
نویسندگان
چکیده
We propose a novel approach to Vietnamese word segmentation. Our approach is based on the Single Classification Ripple Down Rules methodology (Compton and Jansen, 1990), where rules are stored in an exception structure and new rules are only added to correct segmentation errors given by existing rules. Experimental results on the benchmark Vietnamese treebank show that our approach outperforms previous state-of-the-art approaches JVnSegmenter, vnTokenizer, DongDu and UETsegmenter in terms of both accuracy and performance speed. Our code is open-source and available at: https://github.com/datquocnguyen/RDRsegmenter.
منابع مشابه
Fast and Accurate Neural Word Segmentation for Chinese
Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation. However, both training and working procedures of the current neural models are computationally inefficient. This paper presents a greedy neural word segmenter with balanced word and character embedding inputs to alleviate the existing drawba...
متن کاملFarasa: A New Fast and Accurate Arabic Word Segmenter
In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVM using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named ...
متن کاملDeterministic Word Segmentation Using Maximum Matching with Fully Lexicalized Rules
We present a fast algorithm of word segmentation that scans an input sentence in a deterministic manner just one time. The algorithm is based on simple maximum matching which includes execution of fully lexicalized transformational rules. Since the process of rule matching is incorporated into dictionary lookup, fast segmentation is achieved. We evaluated the proposed method on word segmentatio...
متن کاملFrom Word Segmentation to POS Tagging for Vietnamese
This paper presents an empirical comparison of two strategies for Vietnamese Part-of-Speech (POS) tagging from unsegmented text: (i) a pipeline strategy where we consider the output of a word segmenter as the input of a POS tagger, and (ii) a joint strategy where we predict a combined segmentation and POS tag for each syllable. We also make a comparison between state-of-the-art (SOTA) featureba...
متن کاملFarasa: A Fast and Furious Segmenter for Arabic
In this paper, we present Farasa, a fast and accurate Arabic segmenter. Our approach is based on SVM-rank using linear kernels. We measure the performance of the segmenter in terms of accuracy and efficiency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or is at par with the stateof-the-art Arabic segmenters (Stanford and MADAMIRA), while ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1709.06307 شماره
صفحات -
تاریخ انتشار 2017